-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(rust): Use ObjectStore instead of AsyncRead in parquet get metadata #15069
refactor(rust): Use ObjectStore instead of AsyncRead in parquet get metadata #15069
Conversation
7f2815e
to
e123b3b
Compare
e123b3b
to
66fe8d7
Compare
baa9d56
to
860ec33
Compare
} | ||
|
||
/// Asynchronously reads the files' metadata | ||
pub async fn fetch_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.MetadataLoader.html provides an implementation of this which also handles prefetch and loading page indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip @tustvold. @mickvangelderen can you take a look at that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see, that implementation is similar. It allows passing a footer size hint to hopefully save one round trip and has tests that also check the number of fetches.
@ritchie46 what do we want to do with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ritchie46 what do we want to do with that?
In what sense? The footer size or the tests? I would favor us the be conservatively here. Rather a larger request but have one, then needing two calls. But that's my gut feeling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that to make sense we need to have a good idea about how large the footer metadata is in the common case. We are making a trade-off between latency and throughput. If we request a large amount of data so that the footer metadata is likely contained within it, we are possibly waiting to download data that we are going to throw away again (the part that we are downloading but do not use). If we request too few bytes we are affected by both throughput and latency. If we request exactly enough bytes we are affected by latency but not the throughput for bytes that we throw away.
What is most efficient depends on the latency, throughput and footer byte size.
Having to read files backwards is not great. It would be nice if file formats optimize for reading rather than writing. If we take one step back, the real solution is to use a file format that can be read efficiently over a network connection.
Anyway, I am happy to reimplement the optimistic fetch. What should the pre-fetch size be? Should it be configurable (env or parameter)?
FYI there's work in flight to remove the final use of these traits in ObjectStore, I'd be interested in any thoughts on this - apache/arrow-rs#5500 |
I don't. It is really not clear how many calls that will lead to. As it would work for any generic |
@ion-elgreco had a problem, but they are hard to reproduce locally. If we have this PR ready we could ask him to take it for a spin. |
I guess Seems like a nice PR, improving features while removing more code than you're adding is often a good sign. |
@ritchie46 let me know when I can try it out :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great Mick!
} | ||
|
||
/// Asynchronously reads the files' metadata | ||
pub async fn fetch_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ritchie46 what do we want to do with that?
In what sense? The footer size or the tests? I would favor us the be conservatively here. Rather a larger request but have one, then needing two calls. But that's my gut feeling.
Fixes #15032
For the record, I do like the idea of the
AsyncRead
andAsyncSeek
andAsyncWrite
abstraction. We just want to eliminate the possibility of the implementation causing network connection resets. Maybe I have to split the object store deduplication from this PR.@ritchie46 is there any way we can determine whether this solves #14384?